SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–5 + 5.5b)#68
SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–5 + 5.5b)#68
Conversation
Adds two planning docs under docs/superpowers/plans/: - 2026-04-27-smoltcp-passt-port.md (spec) Supersedes the 2026-04-12 network-backend-abstraction design. Replaces "add passt as opt-in backend" with "lift passt's design patterns into our smoltcp stack" — keeps observability, all-Rust path, single binary, cross-platform parity. Lists required skills for execution (rust-style, rustdoc, rust-analyzer-ssr, superpowers TDD/verification, repo verify/profile). Maps the work into 5+1 phases with per-phase plan-doc placeholders. - 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan) 25 bite-sized TDD tasks: correctness baseline pins, divan microbenches, wall-clock e2e harness, NetworkBackend trait extraction, SlirpStack → SmoltcpBackend rename. Includes three BROKEN_ON_PURPOSE assertions that flip in later phases.
Add two baseline tests for the smoltcp DNS proxy: - dns_query_resolves: sends a query for example.com, polls ≤20×100ms, asserts reply XID matches. - dns_cache_keys_by_question_not_xid: warms cache with xid=1, then queries with xid=2 and asserts the stack rewrites the reply XID. Both tests skip gracefully (eprintln + early return) when the upstream resolver is unreachable, making them safe in offline CI. Also adds QNAME_EXAMPLE_COM const and two module-scope helpers: build_dns_query (builds a correct UDP DNS frame with proper payload_len) and parse_dns_reply_xid. SLIRP_DNS_IP added to the existing module-scope slirp import.
Implement measure_tcp_throughput_g2h: binds a host-side TCP listener, boots a VM, execs dd|nc in the guest, drains to EOF on the host, and computes Mbps from bytes_received / elapsed. h2g left None with a TODO.
Implements measure_rr_latency and measure_crr_latency in voidbox-network-bench, reusing the single shared VM booted for throughput measurements. RR: guest pipes N bytes over one persistent nc connection; host times each read+write pair (first sample discarded to absorb connect jitter). CRR: guest runs N independent nc invocations; host times each full accept+read+write+close cycle. Both use the existing percentile() helper (dead_code attribute removed). Latency measurements always run regardless of --no-throughput.
Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp" is the underlying library. Role-based naming keeps the public type surface stable across library swaps and matches the symmetry of future TapBackend / VhostNetBackend siblings. Module file src/network/slirp.rs keeps its name (already aligned with the new type, matches src/devices/virtio_net.rs convention).
The actual polling logic now lives in drain_to_guest, which writes
directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh
allocation on every tick. poll becomes a #[deprecated] shim:
#[deprecated(note = "use drain_to_guest")]
pub fn poll(&mut self) -> Vec<Vec<u8>> {
let mut out = Vec::new();
self.drain_to_guest(&mut out);
out
}
Existing call sites (virtio_net.rs, tests/network_baseline.rs,
benches/network.rs) are annotated with #[allow(deprecated)] and a
TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks,
after which the allow attributes can be removed.
Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call in get_rx_frames with drain_to_guest into a reused rx_scratch buffer. Update both VMM cold-boot and snapshot-restore construction sites to coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline tests pass; fmt and clippy clean.
Type rename only — the slirp.rs module file keeps its name. SlirpBackend reflects the user-mode-NAT role rather than the underlying smoltcp library, keeping naming symmetric with future TapBackend / VhostNetBackend siblings.
Introduces the types and helper needed for ICMP echo NAT (Phase 1):
- IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table.
- IcmpEchoEntry {sock, guest_id, last_activity}: per-request state.
- open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW).
- icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend,
initialized to HashMap::new() in with_security() (the canonical ctor;
new() and Default both delegate through it).
No behavior change — handle_ipv4_frame is untouched, the map stays
empty. Dead-code allowances are scoped to the new items and will be
removed once tasks 1.2/1.3 wire them in.
Add the SynSent state to TcpNatState for host-initiated (port-forward) connections. When handle_tcp_frame sees SYN+ACK on a SynSent entry it sends an ACK to the guest, advances our_seq, records guest_ack, and transitions to Established — completing the inbound 3-way handshake. Add #[cfg(test)] helpers on SlirpBackend (insert_synthetic_synsent_entry, tcp_flow_state, injected_plain_ack_count) and a unit test tcp_inbound_syn_ack_completes_handshake that seeds a SynSent entry, feeds a guest SYN-ACK, and asserts (a) state → Established and (b) one plain ACK queued for injection. The full E2E contract is deferred to task 5.5b.5 (tcp_port_forward_inbound in tests/network_baseline.rs). build_tcp_packet_static signature: (src_ip, dst_ip, src_port, dst_port, seq, ack, control, payload). The inbound ACK uses src=SLIRP_GATEWAY_IP, dst=SLIRP_GUEST_IP, src_port=key.dst_port (high port), dst_port= key.guest_src_port.
Adds a divan bench for the SynSent → Established state-machine path introduced in 5.5b.1. The bench seeds one synthetic SynSent entry, feeds a SYN-ACK frame to process_guest_frame, and measures the transition cost (~42 µs median, same order as process_syn). Approach (option a): widen the three #[cfg(test)] helpers on SlirpBackend to #[cfg(any(test, feature = "bench-helpers"))]. insert_synthetic_synsent_entry is promoted to `pub` within the gated impl block so the bench binary (a separate compilation unit) can call it. The feature is never enabled in production builds. All helpers in benches/network.rs that are only needed under bench-helpers are gated with #[cfg(feature = "bench-helpers")] to keep the default bench binary warning-free.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let mut table: HashMap<usize, u32> = HashMap::with_capacity(n); | ||
| // Insert phase | ||
| for (i, _key) in keys.iter().enumerate() { | ||
| table.insert(i, i as u32); | ||
| } | ||
| // Remove phase | ||
| for i in 0..n { | ||
| divan::black_box(table.remove(&i)); |
There was a problem hiding this comment.
flow_table_insert_remove pre-builds a keys vector but never uses it; the bench actually measures HashMap<usize, u32> insert/remove with integer keys. That doesn’t reflect the stated intent (unified flow-table key hashing/dispatch) and makes the result hard to interpret. Either remove the unused keys construction or use a key type representative of the real flow table (e.g., the actual FlowKey or a synthetic struct with similar hashing cost).
| let mut table: HashMap<usize, u32> = HashMap::with_capacity(n); | |
| // Insert phase | |
| for (i, _key) in keys.iter().enumerate() { | |
| table.insert(i, i as u32); | |
| } | |
| // Remove phase | |
| for i in 0..n { | |
| divan::black_box(table.remove(&i)); | |
| let mut table: HashMap<smoltcp::wire::IpAddress, u32> = HashMap::with_capacity(n); | |
| // Insert phase | |
| for (i, key) in keys.iter().copied().enumerate() { | |
| table.insert(key, i as u32); | |
| } | |
| // Remove phase | |
| for key in keys.iter() { | |
| divan::black_box(table.remove(key)); |
| // Drain backend frames into the reused scratch buffer. | ||
| self.rx_scratch.clear(); | ||
| { | ||
| let mut backend = self.slirp.lock().unwrap(); | ||
| backend.drain_to_guest(&mut self.rx_scratch); | ||
| } | ||
| let frames = std::mem::take(&mut self.rx_scratch); | ||
|
|
There was a problem hiding this comment.
get_rx_frames clears rx_scratch, fills it, then mem::takes it. That replaces the field with a fresh empty Vec (capacity 0), so the next poll re-allocates and the scratch buffer is not actually reused as the comment claims. Consider draining from rx_scratch into frames/result (e.g., drain(..)), or swapping while preserving capacity, so the allocation is retained across calls.
| /// Unlike `build_udp_frame` (which carries a pre-existing off-by-one in | ||
| /// the `payload_len` argument passed to `udp_repr.emit`), this helper | ||
| /// passes only the DNS payload length so the UDP `len` field is correct | ||
| /// and the stack's smoltcp parser accepts the frame. |
There was a problem hiding this comment.
The comment claims build_udp_frame has an off-by-one bug in the payload_len passed to udp_repr.emit, but build_udp_frame currently passes payload.len() (which is the correct smoltcp API expectation). This mismatch is likely to confuse future refactors—either fix the comment or adjust build_udp_frame if it’s actually wrong.
| /// Unlike `build_udp_frame` (which carries a pre-existing off-by-one in | |
| /// the `payload_len` argument passed to `udp_repr.emit`), this helper | |
| /// passes only the DNS payload length so the UDP `len` field is correct | |
| /// and the stack's smoltcp parser accepts the frame. | |
| /// This helper passes the DNS payload length expected by `udp_repr.emit`, | |
| /// so the UDP `len` field is correct and the stack's smoltcp parser | |
| /// accepts the frame. |
| unsafe { | ||
| libc::setsockopt( | ||
| listener.as_raw_fd(), | ||
| libc::SOL_SOCKET, | ||
| libc::SO_RCVBUF, | ||
| &val as *const libc::c_int as *const libc::c_void, | ||
| std::mem::size_of::<libc::c_int>() as libc::socklen_t, | ||
| ); |
There was a problem hiding this comment.
The setsockopt(SO_RCVBUF) call’s return value is ignored. If this fails (e.g., unsupported or restricted), the test won’t actually exercise the intended backpressure scenario but will still pass/fail based on unrelated behavior. Please check the return code (and/or last_os_error) and either assert or skip the test when the option can’t be applied.
| unsafe { | |
| libc::setsockopt( | |
| listener.as_raw_fd(), | |
| libc::SOL_SOCKET, | |
| libc::SO_RCVBUF, | |
| &val as *const libc::c_int as *const libc::c_void, | |
| std::mem::size_of::<libc::c_int>() as libc::socklen_t, | |
| ); | |
| let rc = unsafe { | |
| libc::setsockopt( | |
| listener.as_raw_fd(), | |
| libc::SOL_SOCKET, | |
| libc::SO_RCVBUF, | |
| &val as *const libc::c_int as *const libc::c_void, | |
| std::mem::size_of::<libc::c_int>() as libc::socklen_t, | |
| ) | |
| }; | |
| if rc != 0 { | |
| eprintln!( | |
| "skipping tcp_writes_more_than_256kb_succeed: failed to set SO_RCVBUF: {}", | |
| std::io::Error::last_os_error() | |
| ); | |
| return; |
Adds the host listener thread infrastructure for Phase 5.5b inbound TCP port-forwarding. No listener is spawned yet (that is task 5.5b.4), so there is no behavior change in this commit. New items in src/network/slirp.rs: - `InboundAccept` struct (pub(crate)) — channel payload from listener to net-poll - `port_forward_listeners: Vec<JoinHandle<()>>` field — holds spawn handles - `port_forward_shutdown: Arc<AtomicBool>` field — graceful-shutdown signal - `pending_inbound_accepts: mpsc::Receiver<InboundAccept>` field — accept channel rx - `accept_sender: mpsc::Sender<InboundAccept>` field — keeps channel open + test helper - `process_pending_inbound_accepts()` — drains channel, inserts SynSent entries, queues SYNs - `run_port_forward_listener()` — module-scope thread fn, nonblocking accept loop - `spawn_port_forward_listeners()` — pub(crate) factory, not called until 5.5b.4 - `Drop for SlirpBackend` — sets shutdown flag, joins all listener handles - `push_inbound_accept()` on test-only impl — injects accepts for unit tests - `drain_to_guest` now calls `process_pending_inbound_accepts()` as step 0 TDD: test `process_pending_inbound_accepts_seeds_synsent_and_queues_syn` written and watched fail before implementation; passes in GREEN. 17/17 network_baseline integration tests unchanged.
Call `spawn_port_forward_listeners` in `SlirpBackend::with_security` so host listener threads are actually spawned when `nat.port_forwards` is non-empty. The function now also returns the `Sender` end of the accept channel so `accept_sender` (needed to keep the channel open in tests) is sourced from the same channel pair as `pending_inbound_accepts`. Remove the `#[allow(dead_code)]` attrs from both functions. Unit test `with_security_spawns_listener_per_tcp_port_forward` confirms zero threads for empty rules and one thread per TCP rule.
Adds tcp_port_forward_inbound_connect_succeeds to tests/network_baseline.rs. The test builds a SlirpBackend with one port-forward rule (18080→8080), drives drain_to_guest in a loop while a host thread connects to 127.0.0.1:18080, synthesizes a guest listener by responding with SYN-ACK to the SYN the stack emits, and asserts three contract points: 1. host TcpStream::connect succeeds — listener thread (5.5b.3) is alive. 2. drain_to_guest emits a synthesized SYN to GUEST_PORT — InboundAccept channel + process_pending_inbound_accepts + synthesize_inbound_syn (5.5b.2/5.5b.3/5.5b.4) all fired. 3. drain_to_guest emits the completing ACK after our SYN-ACK — the SynSent → Established arm (5.5b.1) fired. Also adds parse_tcp_to_guest_full helper (superset of parse_tcp_to_guest that also returns src/dst ports, needed to identify the ephemeral high_port in the synthesized SYN). No VM, no --ignored flag, completes in ~0.1 s.
…aseline Times the full inbound port-forward path: host TcpStream::connect → listener thread accept() → mpsc channel → process_pending_inbound_accepts → synthesize_inbound_syn → drain_to_guest output. Bounded above by PORT_FORWARD_POLL_INTERVAL (50ms). Regressions in the inbound state machine or listener poll loop now surface numerically against this baseline.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Push 1 MB in 1 KB chunks. Drain after every batch so the | ||
| // host's read thread can drain the kernel buffer and ACKs flow | ||
| // back to the guest. The new TCP-backpressure path means some | ||
| // chunks won't be ACK'd immediately; we re-send those (TCP-style | ||
| // retransmit) until they go through. | ||
| const TOTAL: usize = 1024 * 1024; | ||
| const CHUNK: usize = 1024; | ||
| let chunk = vec![b'x'; CHUNK]; | ||
| let mut seq = 1001u32; | ||
| let mut acked_seq = 1001u32; | ||
| let mut saw_close = false; | ||
| let deadline = std::time::Instant::now() + std::time::Duration::from_secs(10); | ||
|
|
||
| while bytes_received.load(Ordering::Relaxed) < TOTAL && std::time::Instant::now() < deadline { | ||
| // Send a chunk; advance our seq. | ||
| let _ = stack.process_guest_frame(&build_tcp_frame( | ||
| SLIRP_GATEWAY_IP, | ||
| GUEST_EPHEMERAL_PORT, | ||
| host_port, | ||
| seq, | ||
| our_seq + 1, | ||
| TcpControl::Psh, | ||
| &chunk, | ||
| )); | ||
| seq = seq.wrapping_add(CHUNK as u32); | ||
|
|
There was a problem hiding this comment.
In tcp_writes_more_than_256kb_succeed, the loop unconditionally increments seq after each send, but it never actually retransmits when the stack stops ACKing (backpressure). This contradicts the test’s own comment (“we re-send those”) and can make the test either flaky or fail to validate the intended behavior. Consider only advancing seq once the corresponding ACK is observed (or explicitly resend the same seq/payload when acked_seq hasn’t advanced).
| exit_code = ?output.exit_code, | ||
| stderr = output.stderr_str(), | ||
| "g2h iteration non-zero exit; skipping" | ||
| ); |
There was a problem hiding this comment.
In measure_tcp_throughput_g2h, if the guest nc command exits non-zero you log a warning but still proceed to read the drain-thread result and may include that iteration in the mean. This can silently skew the reported throughput downward (or mask real failures). Consider continue-ing on !output.success() so only successful iterations contribute to the metric (and similarly for the bulk throughput path).
| ); | |
| ); | |
| continue; |
Compares HEAD against an arbitrary baseline ref using the two bench harnesses: divan microbenches (cargo bench --bench network) and the VM-backed wall-clock harness (voidbox-network-bench). Emits markdown with absolute numbers + percent deltas, suitable for PR descriptions. Replaces the scattered /tmp/baseline-network-phase*.json files with a reproducible single entry-point.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // The host accepts CRR_SAMPLES_PER_ITER connections, times each cycle, | ||
| // and sends results back over a channel. | ||
| let (crr_tx, crr_rx) = mpsc::channel::<Vec<Duration>>(); | ||
| let sample_count = CRR_SAMPLES_PER_ITER; | ||
|
|
||
| std::thread::spawn(move || { | ||
| let samples = crr_echo_server(&listener, sample_count); | ||
| let _ = crr_tx.send(samples); | ||
| }); |
There was a problem hiding this comment.
Same accept-thread leak pattern in CRR: the spawned thread blocks on listener.accept() and can persist past the iteration if the guest loop errors or no connections arrive, leaking resources and potentially stalling later runs. Use a bounded accept loop with timeout/cancellation, or accept in the main async task via a non-blocking listener.
|
|
||
| parse_divan() { | ||
| local file="$1" | ||
| LC_ALL=en_US.UTF-8 awk -F'│' ' |
There was a problem hiding this comment.
LC_ALL=en_US.UTF-8 is not guaranteed to exist on minimal distros/CI images, which can make parse_divan fail even when awk is present. Consider using a more universally available UTF-8 locale like C.UTF-8 (with a fallback to C) or avoid locale-dependence by normalizing the µs unit in the input before parsing.
| LC_ALL=en_US.UTF-8 awk -F'│' ' | |
| local awk_locale="C" | |
| if command -v locale >/dev/null 2>&1; then | |
| if locale -a 2>/dev/null | grep -Fxq 'C.UTF-8'; then | |
| awk_locale='C.UTF-8' | |
| elif locale -a 2>/dev/null | grep -Fxq 'C.utf8'; then | |
| awk_locale='C.utf8' | |
| elif locale -a 2>/dev/null | grep -Fxq 'en_US.UTF-8'; then | |
| awk_locale='en_US.UTF-8' | |
| fi | |
| fi | |
| LC_ALL="$awk_locale" awk -F'│' ' |
| let (drain_tx, drain_rx) = mpsc::channel::<(u64, Duration)>(); | ||
| std::thread::spawn(move || { | ||
| let drain_result = drain_one_connection(&listener); | ||
| let _ = drain_tx.send(drain_result); | ||
| }); |
There was a problem hiding this comment.
Same accept-thread leak pattern as measure_tcp_throughput_g2h: the spawned thread blocks on listener.accept() and can outlive the iteration on exec failure or recv_timeout, accumulating stuck threads and open FDs. Use a bounded accept with timeout/cancellation so the harness remains stable under failure conditions.
| let (echo_tx, echo_rx) = mpsc::channel::<Vec<Duration>>(); | ||
|
|
||
| std::thread::spawn(move || { | ||
| let samples = rr_echo_server(&listener, RR_SAMPLES_PER_ITER); | ||
| let _ = echo_tx.send(samples); | ||
| }); |
There was a problem hiding this comment.
Same issue here: the RR echo server thread blocks on listener.accept() and isn’t joined/cancelled. If the guest nc fails to connect, the thread can hang indefinitely holding the listener, which can leak threads across iterations. Prefer non-blocking accept with a deadline, or a cancellable blocking task.
Reviewer (Copilot) flagged that the previous skip-on-no-reply path masked real ICMP regressions on hosts where unprivileged ICMP works. Probe socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) once: skip only on EPERM/EACCES (sysctl net.ipv4.ping_group_range forbids it), assert otherwise.
C1.3: measure_tcp_throughput_g2h and measure_bulk_throughput_g2h now `continue` on guest nc non-zero exit so failed iterations don't skew the reported mean. C2.2: measure_icmp_rr_latency dropped its guest-side ping path. The guest images intentionally omit /bin/ping (busybox-static lacks CONFIG_FEATURE_PING_TYPE_DGRAM and SOCK_RAW would need root); the function now returns None with a warn explaining the gap. Proper host-driven measurement is tracked as a follow-up.
Aligns benches with the production RX path. The deprecated poll() allocated a fresh Vec<Vec<u8>> per call; drain_to_guest appends to a caller-owned buffer that's reused across iterations. CI now gates on the same allocator pattern production code uses, removing avoidable allocation overhead from the measurements. Drops the file-level #![allow(deprecated)].
When the baseline ref pre-dates a464fc1 (Phase 5.5b.1) it doesn't have the `bench-helpers` cargo feature, and cargo errored out, dropping the entire baseline divan section to —. Detect the "does not have feature" / "unknown feature" stderr signal and retry without --features bench-helpers. Benches that exist at both refs get real Δ%; the bench-helpers-gated ones (synthesize_inbound_syn, tcp_inbound_syn_ack_transition) naturally remain — for baseline. Unlocks bench-compare.sh --baseline origin/main against the full branch history.
drain_one_connection, rr_echo_server, crr_echo_server now accept with a deadline derived from LATENCY_RECV_TIMEOUT + slack. Previously each spawned thread blocked forever on listener.accept() if the guest nc never connected (exec error, network failure), holding the listener FD across all subsequent iterations and burning thread/FD slots. When the accept deadline lapses, the thread exits cleanly, the listener drops, and the next iteration starts with a clean slate. Addresses Copilot review C2.1, C2.5, C2.6.
… + epoll
Scopes the four architectural follow-ups surfaced in the
smoltcp-passt-port PR review:
- 6.1 (high) : TCP half-close (FinWait*/CloseWait/LastAck) — silent
data loss on shutdown(SHUT_WR) today.
- 6.2 (med-h) : Async outbound connect — vCPU thread blocked up to
3 s on slow destinations.
- 6.3 (med) : Window management + scaling — guest window ignored;
advertised window hardcoded 65535.
- 6.4 (med-l) : Event-driven RX polling — replace 5 ms timer with
epoll_wait.
Locks the observability + cross-platform + snapshot invariants from
the top-level spec. Per-subsystem TDD task lists deferred to dedicated
plans (-phase6.1.md..-phase6.4.md) written before each kicks off.
…ghput vs main) (#69) * docs(plans): smoltcp passt-pattern port — spec + Phase 0 plan Adds two planning docs under docs/superpowers/plans/: - 2026-04-27-smoltcp-passt-port.md (spec) Supersedes the 2026-04-12 network-backend-abstraction design. Replaces "add passt as opt-in backend" with "lift passt's design patterns into our smoltcp stack" — keeps observability, all-Rust path, single binary, cross-platform parity. Lists required skills for execution (rust-style, rustdoc, rust-analyzer-ssr, superpowers TDD/verification, repo verify/profile). Maps the work into 5+1 phases with per-phase plan-doc placeholders. - 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan) 25 bite-sized TDD tasks: correctness baseline pins, divan microbenches, wall-clock e2e harness, NetworkBackend trait extraction, SlirpStack → SmoltcpBackend rename. Includes three BROKEN_ON_PURPOSE assertions that flip in later phases. * test(network): scaffold network_baseline pins with frame helpers * test(network): address review — restore reserved constants, alias IpAddress * test(network): pin TCP handshake SYN-ACK emission * test(network): pin TCP guest↔host data round-trip * test(network): BROKEN_ON_PURPOSE pin — 256 KB to_host cliff * test(network): hoist inline `use` statements to module scope * test(network): pin TCP rate limit, concurrent cap, deny list * test(network): pin ARP reply behavior for gateway and subnet * test(network): pin DNS resolution and cache xid-rewrite Add two baseline tests for the smoltcp DNS proxy: - dns_query_resolves: sends a query for example.com, polls ≤20×100ms, asserts reply XID matches. - dns_cache_keys_by_question_not_xid: warms cache with xid=1, then queries with xid=2 and asserts the stack rewrites the reply XID. Both tests skip gracefully (eprintln + early return) when the upstream resolver is unreachable, making them safe in offline CI. Also adds QNAME_EXAMPLE_COM const and two module-scope helpers: build_dns_query (builds a correct UDP DNS frame with proper payload_len) and parse_dns_reply_xid. SLIRP_DNS_IP added to the existing module-scope slirp import. * test(network): fix build_udp_frame payload_len double-count * test(network): BROKEN_ON_PURPOSE pin — UDP non-DNS dropped * test(network): BROKEN_ON_PURPOSE pin — ICMP echo dropped * bench(network): divan microbenches for SLIRP hot paths * bench(network): parametric NAT-walk scaling at 1/100/1000 flows * bench(network): DNS cache hit and miss paths * ci(bench): include network microbenches in regression gate * bench(network): voidbox-network-bench binary scaffold * bench(network): TCP throughput via busybox nc Implement measure_tcp_throughput_g2h: binds a host-side TCP listener, boots a VM, execs dd|nc in the guest, drains to EOF on the host, and computes Mbps from bytes_received / elapsed. h2g left None with a TODO. * bench(network): TCP RR/CRR latency p50/p99 Implements measure_rr_latency and measure_crr_latency in voidbox-network-bench, reusing the single shared VM booted for throughput measurements. RR: guest pipes N bytes over one persistent nc connection; host times each read+write pair (first sample discarded to absorb connect jitter). CRR: guest runs N independent nc invocations; host times each full accept+read+write+close cycle. Both use the existing percentile() helper (dead_code attribute removed). Latency measurements always run regardless of --no-throughput. * bench(network): UDP DNS qps and JSON report output * docs(plans): rename SmoltcpBackend → SlirpBackend in spec + Phase 0 plan Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp" is the underlying library. Role-based naming keeps the public type surface stable across library swaps and matches the symmetry of future TapBackend / VhostNetBackend siblings. Module file src/network/slirp.rs keeps its name (already aligned with the new type, matches src/devices/virtio_net.rs convention). * feat(network): introduce NetworkBackend trait * refactor(slirp): add drain_to_guest wrapper for trait fit * refactor(slirp): move poll body into drain_to_guest, drop alloc The actual polling logic now lives in drain_to_guest, which writes directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh allocation on every tick. poll becomes a #[deprecated] shim: #[deprecated(note = "use drain_to_guest")] pub fn poll(&mut self) -> Vec<Vec<u8>> { let mut out = Vec::new(); self.drain_to_guest(&mut out); out } Existing call sites (virtio_net.rs, tests/network_baseline.rs, benches/network.rs) are annotated with #[allow(deprecated)] and a TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks, after which the allow attributes can be removed. * feat(slirp): impl NetworkBackend for SlirpStack * refactor(virtio_net): hold dyn NetworkBackend, reuse rx buffer Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call in get_rx_frames with drain_to_guest into a reused rx_scratch buffer. Update both VMM cold-boot and snapshot-restore construction sites to coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline tests pass; fmt and clippy clean. * refactor(network): rename SlirpStack to SlirpBackend Type rename only — the slirp.rs module file keeps its name. SlirpBackend reflects the user-mode-NAT role rather than the underlying smoltcp library, keeping naming symmetric with future TapBackend / VhostNetBackend siblings. * docs(plans): add Phase 1 plan (ICMP echo via SOCK_DGRAM IPPROTO_ICMP) * feat(slirp): add IcmpEchoEntry + IPPROTO_ICMP socket helper Introduces the types and helper needed for ICMP echo NAT (Phase 1): - IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table. - IcmpEchoEntry {sock, guest_id, last_activity}: per-request state. - open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW). - icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend, initialized to HashMap::new() in with_security() (the canonical ctor; new() and Default both delegate through it). No behavior change — handle_ipv4_frame is untouched, the map stays empty. Dead-code allowances are scoped to the new items and will be removed once tasks 1.2/1.3 wire them in. * refactor(slirp): hoist FromRawFd to module scope, drop redundant use libc * feat(slirp): forward guest ICMP echo via SOCK_DGRAM IPPROTO_ICMP * feat(slirp): relay ICMP echo replies back to guest Add `relay_icmp_echo` which drains replies from each active ICMP echo socket and injects Ethernet/IPv4/ICMP echo-reply frames back into the guest. Add `build_icmp_echo_reply_to_guest` which parses the raw ICMP payload from the `SOCK_DGRAM IPPROTO_ICMP` socket, rewrites the ident back to the guest's original value, and builds a complete wire frame. Wire both into `drain_to_guest` immediately after `relay_tcp_nat_data`. Drop the now-stale `#[allow(dead_code)]` on `IcmpEchoEntry::guest_id` which is read by `relay_icmp_echo`. * feat(slirp): warn-once + fallback when unprivileged ICMP forbidden * test(network): flip ICMP pin — assert echo reply (was BROKEN_ON_PURPOSE) Renames `icmp_echo_silently_dropped` → `icmp_echo_returns_reply`. Targets 127.0.0.1 (loopback), polls 20 × 50ms for the reply, and skips via eprintln! if sysctl forbids unprivileged ICMP — consistent with how `dns_query_resolves` handles offline environments. * bench(network): populate ICMP RR latency p50 Add measure_icmp_rr_latency() to voidbox-network-bench. Runs busybox ping -c <count> -W 1 -i 0.05 8.8.8.8 inside the guest, parses time=<ms> fields, converts to microseconds, and returns the p50 median. Falls back to None + WARN on non-zero exit or empty parse (unreachable network). Wired into main after measure_dns_qps; always runs regardless of --no-throughput. Also: allow unprivileged ICMP sockets in guest-agent (ping_group_range) and add ping + setuid busybox to the test initramfs build. * fix(scripts): revert setuid busybox in test image (Phase 1.6 regression) Phase 1.6 (commit 8572122) added `chmod u+s "$OUT_DIR/bin/busybox"` to let busybox `ping` open SOCK_RAW. The unintended consequence: cpio is packed as the build user (uid 1000), so the kernel drops euid to 1000 on every execve from PID 1. In `guest-agent::setup_network`, that meant `ip link up`, `ip addr replace`, and `udhcpc` all silently failed with EPERM (no CAP_NET_ADMIN). The static-fallback loop wasted 10s of boot time. Combined with the vsock listener creation retry, total guest-agent startup exceeded the host's 30s control-channel handshake deadline → ECONNRESET on every connect → `voidbox-network-bench` and any test using `network(true)` failed with `control_channel: deadline reached`. Verification: - With setuid: bench fails consistently after 122 connect attempts. - Without setuid: bench produces clean numbers matching Phase 0 baseline (TCP RR p50=2µs, CRR p50=10176µs, DNS qps=0.5). The `ping` symlink is also dropped because busybox-static on Fedora is not built with CONFIG_FEATURE_PING_TYPE_DGRAM, so unprivileged ICMP is unavailable to the guest applet regardless. ICMP measurement in voidbox-network-bench now reports `null` cleanly ("ping: not found") until we route ICMP RR through SLIRP from the host instead. The companion `ping_group_range` write in guest-agent stays — it's harmless and supports future tooling that uses SOCK_DGRAM IPPROTO_ICMP. * docs(plans): add Phase 2 plan (generalize UDP via per-flow connected sockets) * feat(slirp): add UdpFlowEntry + per-flow connected socket helper * feat(slirp): forward non-DNS UDP via per-flow connected sockets Implement Task 2.2: route all guest UDP through handle_udp_frame, which creates/reuses a per-flow connected UdpSocket keyed on (guest_src_port, dst_ip, dst_port). DNS to SLIRP_DNS_IP still dispatches to the existing handle_dns_frame. SLIRP_GATEWAY_IP (10.0.2.2) is translated to 127.0.0.1 before connect(), matching the TCP NAT path. Drop #[allow(dead_code)] from UdpFlowEntry (item-level), open_udp_flow_socket, and the udp_flows field — all now consumed. Add a field-targeted #[allow(dead_code)] on last_activity (written here, read by Task 2.4). Flip the udp_non_dns_silently_dropped BROKEN_ON_PURPOSE pin: datagrams now reach the bound host socket, confirming the guest→host send path works. All 14 baseline tests pass. * ci(bench): add strict voidbox-network-bench step (no continue-on-error) Catches setuid-busybox-style regressions that masquerade as "environment flakes". Specifically: the bug fixed at 77dfc67 (Phase 1.6 added `chmod u+s busybox`, dropping PID 1 euid → no CAP_NET_ADMIN → setup_network silently fails → 30s handshake deadline expires → ECONNRESET) would have been visible in CI from the start if the wall-clock harness step weren't behind `continue-on-error: true`. This new step runs `voidbox-network-bench --iterations 3` and publishes the JSON metrics to the step summary. Failure of the harness fails the workflow — no masking. The existing `voidbox-startup-bench` step keeps `continue-on-error` for now because its warm-restore phase has a separate, unfixed issue (`control_channel[multiplex-establish]: deadline reached` reproducible on main); flipping that to strict belongs in the PR that fixes the warm-restore handshake. Vhost-vsock probe still gates the run via `/dev/vhost-vsock` existence check — runners without it skip cleanly with a warning, since absence-of-device is an environment fact, not a regression. * feat(slirp): relay UDP flow replies back to guest Add `relay_udp_flows` and `build_udp_reply_to_guest` to `SlirpBackend`. Each active UDP flow socket is polled non-blocking on every `drain_to_guest` tick; replies are wrapped in an Ethernet/IPv4/UDP frame (src=original-dst, dst=guest) and pushed into `inject_to_guest`. Wire the call into `drain_to_guest` after `relay_icmp_echo`. Also add `UdpRepr` to the smoltcp wire imports and drop the now-consumed `#[allow(dead_code)]` on `UdpFlowEntry::last_activity`. * feat(slirp): UDP flow idle reap (60s) * test(network): full RTT for UDP pin (was BROKEN_ON_PURPOSE one-way) Rename udp_non_dns_silently_dropped → udp_non_dns_round_trips and rewrite the body to verify the complete guest→host→guest round-trip via the per-flow connected-socket NAT landed in Tasks 2.1–2.4. * bench(network): document DNS qps busybox-nc bottleneck (set null + WARN) busybox nc -u -w1 blocks for the full 1-second timeout after stdin EOF even when the cached SLIRP reply arrives in microseconds, capping throughput at ~1 qps. Tighter flags tried: -q0 exits before the reply arrives (0 successes); /dev/udp/ is bash-only; timeout(1) is absent from the test initramfs. Report udp_dns_qps as null with a WARN pointing to the host-side UDP socket path as the correct future fix. Also removes the now-dead DNS_QPS_WINDOW_SECS and SLIRP_DNS_ADDR constants. * fix(startup-bench): require userspace vsock backend for snapshot capture The bench's `capture_snapshot` was building a Sandbox without `.enable_snapshots(true)`, so the backend selector at `backend/kvm.rs:212` chose `VsockBackendType::Vhost` (lower per-RPC latency for cold-only runs). The `create_auto_snapshot` call then captured a vhost-shaped snapshot. But `from_snapshot` always restores into `VsockBackendType::Userspace` — a path that knows how to re-program our process-local vring state, while vhost's vring state lives in the host kernel's `vhost-vsock` module and isn't part of the snapshot at all. Result: the restored userspace device has half-blank state, never accepts connections from the host, every connect attempt is RST'd by the guest kernel, and the multiplex handshake hits its 30s deadline. Symptom across CI and local Fedora bare-metal: control_channel[multiplex-establish]: deadline reached after 123 connect/handshake attempts Error: Guest("control_channel: deadline reached") This same failure was visible in CI run 24983657846 on main (April 27, before any of the SLIRP refactor work) — masked by `continue-on-error: true` on the wall-clock harness step. Removing both: the fix and the CI mask, so a regression of this exact shape would now fail the workflow. Verified locally: `voidbox-startup-bench --iters 3 --breakdown` now exits 0 with `warm.total p50 = 82ms` (well within the CHANGELOG's 138ms target). Cold phase numbers unchanged (~245ms p50). Refs: - backend/kvm.rs:205-216 (the backend selector) - CHANGELOG.md:74 ("Snapshot/Restore for KVM ... userspace virtio-vsock backend") - AGENTS.md:1185 ("snapshot_integration ... Uses userspace virtio-vsock backend") * docs(plans): add Phase 3 plan (TCP relay rewrite via MSG_PEEK + sequence mirroring) * docs(plans): lock observability as a hard non-negotiable invariant Per user pushback ("the improvements based on passt, maintain our differentiator of full observability on the SLIRP implementation, that is a must?") — yes, and it should be stated explicitly, not assumed. Spec gets a "Hard invariant — observability" section right after the motivation. Phase 3 plan gets a "Non-negotiable invariants" block that codifies what every task in the high-risk TCP-relay rewrite must preserve: - All-Rust, no opaque-process boundary; libc syscalls are fine. - tracing instrumentation at every state transition (peek, ACK consume, close); new code must add new events for new state. - cargo-test-driveable behavior via tests/network_baseline.rs. - Standard Rust tooling (LSP, clippy, profiler) keeps working. Future phases inherit the spec-level invariant; their per-phase plans will reiterate the task-level acceptance criteria. * refactor(slirp): add bytes_in_flight to TcpNatEntry (no behavior change) Scaffolding for Task 3.3/3.4: tracks bytes sent to guest but not yet ACK'd. Initialized to 0 at all construction sites; dead_code suppressed until Task 3.3 consumes it. * refactor(slirp): add recv_peek helper using libc::recv MSG_PEEK * refactor(slirp): peek-based host→guest TCP relay (drops to_guest dependency) Replace the read-into-to_guest + drain loop in relay_tcp_nat_data with a MSG_PEEK-based send path. recv_peek() peeks the kernel's recv buffer without consuming it; only the bytes past bytes_in_flight are chunked into TCP segments and injected toward the guest. our_seq and bytes_in_flight advance as segments are sent; the kernel buffer holds the data until Task 3.4's ACK-driven read() consumes it. Remove #[allow(dead_code)] from recv_peek and bytes_in_flight (both now consumed). Add #[allow(dead_code)] to to_guest (still on struct; Task 3.6 removes it). Drop unused Read import. Tracing: trace! per peek+send cycle, debug! on host EOF, warn! on recv_peek errors. * refactor(slirp): ACK-driven consume from kernel socket When the guest ACKs data we relayed via the peek-based host→guest path (Task 3.3), read() those bytes from the host_stream to advance the kernel's recv buffer read pointer. Without this the kernel buffer fills up and recv_peek keeps returning the same already-sent bytes. Logic in handle_tcp_frame, Established branch: - Extract segment_ack from tcp.ack_number().0 as u32. - Compute last_sent_acked = our_seq.wrapping_sub(bytes_in_flight). - acked_bytes = segment_ack.wrapping_sub(last_sent_acked) — wrapping arithmetic throughout because TCP sequence numbers wrap at 2^32. - Guard: only consume when acked_bytes > 0 && <= bytes_in_flight, defending against duplicate/spurious/malformed guest ACKs. - Drain via read() loop into a stack sink; decrement bytes_in_flight by the actual drained count, not the claimed acked_bytes. - tracing::trace! on each consume; tracing::warn! + Closed on read error. All 14 network_baseline tests pass including tcp_data_round_trip. * refactor(slirp): drop to_host buffer + 256KB cliff, use TCP backpressure Replace the guest→host write path with don't-ACK-on-EAGAIN: on WouldBlock, skip the ACK and let the guest TCP retransmit once the kernel send buffer drains. Remove the MAX_TO_HOST_BUFFER constant (256 KB cap), the overflow- close branch, and the relay_tcp_nat_data to_host flush block. Mark to_host and to_host_pending_ack dead_code pending Task 3.6 cleanup. * refactor(slirp): drop to_guest/to_host/pending_ack fields and dead helpers Remove three dead fields from TcpNatEntry that were superseded by the passt-style peek+ACK+backpressure paths added in Tasks 3.3–3.5: - to_guest: Vec<u8> (replaced by recv(MSG_PEEK)-based send in 3.3) - to_host: Vec<u8> (replaced by direct write + don't-ACK-on-WouldBlock in 3.5) - to_host_pending_ack: Option<u32> (replaced by ACK on n_written in 3.5) Remove the three matching initializer sites from the TcpNatEntry constructor. Update the file-level Architecture doc and the bytes_in_flight field comment to reflect the Phase 3 design (no userspace buffers; kernel socket buffer holds outstanding data). * test(network): flip 256KB cliff pin — assert >1MB throughput succeeds Renames tcp_to_host_buffer_drops_at_256kb → tcp_writes_more_than_256kb_succeed and rewrites the body to assert the Phase 3 positive contract: pushing 1 MB through the relay succeeds with no RST/FIN mid-stream. Updates the file-level BROKEN_ON_PURPOSE inventory accordingly. * bench(network): tcp_bulk_throughput_1mb — measures post-Phase-3 backpressure path Adds a divan microbench that pushes 1 MiB through the SLIRP relay under a constrained host receiver (SO_RCVBUF=4096), forcing the passt-style sequence-mirroring + don't-ACK-on-EAGAIN backpressure path on every iteration. Reports throughput in MB/s via BytesCount so regressions are numerically visible. Mirrors the 95%-delivery threshold from the tcp_writes_more_than_256kb_succeed contract test. ~61 ms median / ~17 MB/s on this host; 10 samples complete in well under 60 s total. * bench(network): --bulk-mb mode to measure post-Phase-3 backpressure Adds a guest→host throughput measurement that pins the host listener's SO_RCVBUF to 4096 before accept(). The constrained receiver forces TCP-level backpressure to engage during the transfer: the SLIRP relay's non-blocking write to host_stream returns WouldBlock, the relay declines to ACK the segment, and the guest retransmits — exercising the don't-ACK-on-EAGAIN path that Phase 3 introduced. Pre-Phase-3 the same scenario hit the 256 KB userspace cliff and reset the connection mid-transfer; post-Phase-3 the bytes go through. Smoke run on this host (Fedora 43 / KVM / slim x86_64): bulk-g2h[ 0]: 10485760 B in 0.429s = 1565.6 Mbps (constrained receiver) Compare to the unconstrained tcp_throughput_g2h_mbps (~1885 Mbps) — the ~17% reduction is the backpressure cost. The metric is opt-in (--bulk-mb 0 by default) so it doesn't slow down standard runs. Companion to the divan microbench tcp_bulk_throughput_1mb (commit 5fe4316) that exercises the same path at the unit level. The wall-clock metric is what we'd compare against passt+qemu in the future side-by-side run. * docs(plans): add Phase 4 plan (unified flow table refactor) * refactor(slirp): define FlowKey + FlowEntry enums (no callers yet) Add Copy to NatKey (all fields are trivially copyable: u16, Ipv4Address, u16) and clean up three clone_on_copy sites that clippy now catches. Introduce FlowKey and FlowEntry alongside the existing per-protocol types; both are marked #[allow(dead_code)] until Task 4.2 wires the unified flow_table field. * fix(ci): non-Linux stubs for benches/network.rs + voidbox-network-bench Both files used file-level `#![cfg(target_os = "linux")]`, which on macOS produces an empty crate with no `main()` → E0601. Caught by PR #68's macOS CI lanes (Lint, MSRV, Test, E2E VZ). Fix mirrors `benches/startup.rs`: keep the `main()` shape unconditional and gate only the SLIRP-using imports + body. The smoltcp dep is already `cfg(target_os = "linux")` in Cargo.toml, so the Linux-only items genuinely can't compile on macOS — wrapping them in a Linux-only module is the cleanest way to keep the cfg gating in one place. - `benches/network.rs`: `mod linux_benches { ... }` wraps every helper and `#[divan::bench]`. Top-level `fn main()` calls `divan::main()` on Linux and prints a skip notice elsewhere. - `src/bin/voidbox-network-bench/main.rs`: `mod linux_main { ... }` wraps everything from `TRANSFER_MB` to the bottom of the file. Top-level provides two cfg-gated `fn main()` shapes — Linux delegates to `linux_main::main_impl()`, non-Linux prints a skip notice. Linux validation: - cargo fmt --check: clean - cargo clippy -D warnings: clean - cargo test --test network_baseline: 14/14 * docs(plans): add three Phase 4 benches (mixed flows, per-protocol, table ops) * refactor(slirp): add flow_table field on SlirpBackend (parallel to existing maps) * refactor(slirp): migrate ICMP path to flow_table Replace all self.icmp_echo accesses (5 sites: entry insert/lookup in handle_icmp_frame, keys iteration + get_mut + remove in relay_icmp_echo) with self.flow_table keyed on FlowKey::IcmpEcho. Drop the icmp_echo field and its HashMap::new() initializer. Drop #[allow(dead_code)] from FlowKey, FlowEntry, and flow_table; add variant-level #[allow(dead_code)] on Tcp/Udp variants that are consumed in tasks 4.4 and 4.5. All 14 network_baseline pins pass; fmt + clippy clean. * refactor(slirp): migrate UDP path to flow_table Replace all self.udp_flows accesses with self.flow_table keyed on FlowKey::Udp / FlowEntry::Udp, following the same pattern as the ICMP migration in 4.3. Drop the udp_flows field and its HashMap::new() initializer. Remove #[allow(dead_code)] from FlowKey::Udp and FlowEntry::Udp now that both variants are consumed. * refactor(slirp): migrate TCP path to flow_table Move all 9 self.tcp_nat accesses to self.flow_table under FlowKey::Tcp / FlowEntry::Tcp. Drop the tcp_nat field and its HashMap::new() initialiser. Remove #[allow(dead_code)] from FlowKey::Tcp and FlowEntry::Tcp now that both variants are actively consumed. Max-concurrent check now counts FlowKey::Tcp entries in flow_table. relay_tcp_nat_data collects TCP flow keys then iterates with get_mut, matching the established ICMP/UDP patterns from 4.3–4.4. All 14 network_baseline tests pass; tcp_bulk_throughput_1mb bench: 17.06 MB/s. * refactor(slirp): update Phase 4 doc header for unified flow table * bench(network): poll_with_n_mixed_flows — mixed TCP/UDP/ICMP at scale * bench(network): process_udp_frame + process_icmp_echo_request * bench(network): add flow_table_insert_remove synthetic microbench Pure-compute baseline for Phase 4's unified HashMap<FlowKey, FlowEntry>. Measures insert + remove throughput on n=[10, 100, 1000] entries using synthetic u32 values, isolating HashMap mechanics from socket overhead. Phase 5+ reference number for hasher experiments (foldhash, ahash, SipHash) or container-shape changes (hashbrown raw API). Uses proxy data (usize -> u32 map) instead of real TcpNatEntry to avoid socket cloning cost per insert — the bench goal is HashMap cost, not socket ops. * docs(plans): add Phase 5 plan (stateless NAT + port forwarding) * feat(network): add nat.rs with stateless translate_outbound (no callers yet) Pure types (Rules, PortForward, ForwardProto) and translate_outbound function that maps guest destination addresses to host SocketAddrs. No per-flow state; deny-list check beats gateway-loopback rewrite. Doc-test + unit tests included. * refactor(slirp): add nat::Rules field on SlirpBackend (parallel to deny_list) * refactor(slirp): TCP path uses nat::translate_outbound Replace the two inline operations in the SYN branch of handle_tcp_frame with a single nat::translate_outbound call: - the SLIRP_GATEWAY_IP → 127.0.0.1 rewrite - the deny-list iteration (previously via is_denied) The RST-emission shape and warn! event are preserved verbatim. Drop the now-callerless is_denied method; add #[allow(dead_code)] to deny_list (still held for task 5.4 which migrates UDP and then drops the field). * refactor(slirp): UDP path uses nat::translate_outbound, drop deny_list field * refactor(slirp): plumb port_forwards from NetworkConfig into nat::Rules Add `port_forwards: &[(u16, u16)]` to `SlirpBackend::with_security`. Each tuple is mapped to `nat::PortForward { proto: ForwardProto::Tcp, .. }` and stored in `nat::Rules.port_forwards`. `SlirpBackend::new()` passes `&[]` as before. The cold-boot VMM construction site (`src/vmm/mod.rs`) also passes `&[]` with a TODO(5.5b) comment — `VoidBoxConfig` does not yet carry `port_forwards`, so wiring the real slice is deferred to sub-task B. The snapshot-restore site calls `SlirpBackend::new()` and is unaffected. No relay code reads `nat.port_forwards`; no host listeners are spawned. Sub-task B (5.5b) will add the actual TcpListener-per-rule logic. All 14 network_baseline tests pass. fmt + clippy clean. * test(network): pin nat::translate_outbound (loopback, external, deny) * bench(network): nat_translate_outbound_hot_path — Phase 5 baseline * feat(slirp): TcpNatState::SynSent + handle inbound SYN-ACK Add the SynSent state to TcpNatState for host-initiated (port-forward) connections. When handle_tcp_frame sees SYN+ACK on a SynSent entry it sends an ACK to the guest, advances our_seq, records guest_ack, and transitions to Established — completing the inbound 3-way handshake. Add #[cfg(test)] helpers on SlirpBackend (insert_synthetic_synsent_entry, tcp_flow_state, injected_plain_ack_count) and a unit test tcp_inbound_syn_ack_completes_handshake that seeds a SynSent entry, feeds a guest SYN-ACK, and asserts (a) state → Established and (b) one plain ACK queued for injection. The full E2E contract is deferred to task 5.5b.5 (tcp_port_forward_inbound in tests/network_baseline.rs). build_tcp_packet_static signature: (src_ip, dst_ip, src_port, dst_port, seq, ack, control, payload). The inbound ACK uses src=SLIRP_GATEWAY_IP, dst=SLIRP_GUEST_IP, src_port=key.dst_port (high port), dst_port= key.guest_src_port. * bench(network): tcp_inbound_syn_ack_transition — Phase 5.5b.1 microbench Adds a divan bench for the SynSent → Established state-machine path introduced in 5.5b.1. The bench seeds one synthetic SynSent entry, feeds a SYN-ACK frame to process_guest_frame, and measures the transition cost (~42 µs median, same order as process_syn). Approach (option a): widen the three #[cfg(test)] helpers on SlirpBackend to #[cfg(any(test, feature = "bench-helpers"))]. insert_synthetic_synsent_entry is promoted to `pub` within the gated impl block so the bench binary (a separate compilation unit) can call it. The feature is never enabled in production builds. All helpers in benches/network.rs that are only needed under bench-helpers are gated with #[cfg(feature = "bench-helpers")] to keep the default bench binary warning-free. * feat(slirp): add synthesize_inbound_syn helper for port-forwarding * bench(network): synthesize_inbound_syn pure-compute (Phase 5.5b.2.b) * feat(slirp): port-forward listener thread implementation (not wired yet) Adds the host listener thread infrastructure for Phase 5.5b inbound TCP port-forwarding. No listener is spawned yet (that is task 5.5b.4), so there is no behavior change in this commit. New items in src/network/slirp.rs: - `InboundAccept` struct (pub(crate)) — channel payload from listener to net-poll - `port_forward_listeners: Vec<JoinHandle<()>>` field — holds spawn handles - `port_forward_shutdown: Arc<AtomicBool>` field — graceful-shutdown signal - `pending_inbound_accepts: mpsc::Receiver<InboundAccept>` field — accept channel rx - `accept_sender: mpsc::Sender<InboundAccept>` field — keeps channel open + test helper - `process_pending_inbound_accepts()` — drains channel, inserts SynSent entries, queues SYNs - `run_port_forward_listener()` — module-scope thread fn, nonblocking accept loop - `spawn_port_forward_listeners()` — pub(crate) factory, not called until 5.5b.4 - `Drop for SlirpBackend` — sets shutdown flag, joins all listener handles - `push_inbound_accept()` on test-only impl — injects accepts for unit tests - `drain_to_guest` now calls `process_pending_inbound_accepts()` as step 0 TDD: test `process_pending_inbound_accepts_seeds_synsent_and_queues_syn` written and watched fail before implementation; passes in GREEN. 17/17 network_baseline integration tests unchanged. * feat(slirp): wire spawn_port_forward_listeners from with_security Call `spawn_port_forward_listeners` in `SlirpBackend::with_security` so host listener threads are actually spawned when `nat.port_forwards` is non-empty. The function now also returns the `Sender` end of the accept channel so `accept_sender` (needed to keep the channel open in tests) is sourced from the same channel pair as `pending_inbound_accepts`. Remove the `#[allow(dead_code)]` attrs from both functions. Unit test `with_security_spawns_listener_per_tcp_port_forward` confirms zero threads for empty rules and one thread per TCP rule. * test(network): tcp_port_forward_inbound — Phase 5.5b e2e contract Adds tcp_port_forward_inbound_connect_succeeds to tests/network_baseline.rs. The test builds a SlirpBackend with one port-forward rule (18080→8080), drives drain_to_guest in a loop while a host thread connects to 127.0.0.1:18080, synthesizes a guest listener by responding with SYN-ACK to the SYN the stack emits, and asserts three contract points: 1. host TcpStream::connect succeeds — listener thread (5.5b.3) is alive. 2. drain_to_guest emits a synthesized SYN to GUEST_PORT — InboundAccept channel + process_pending_inbound_accepts + synthesize_inbound_syn (5.5b.2/5.5b.3/5.5b.4) all fired. 3. drain_to_guest emits the completing ACK after our SYN-ACK — the SynSent → Established arm (5.5b.1) fired. Also adds parse_tcp_to_guest_full helper (superset of parse_tcp_to_guest that also returns src/dst ports, needed to identify the ephemeral high_port in the synthesized SYN). No VM, no --ignored flag, completes in ~0.1 s. * bench(network): port_forward_accept_latency — Phase 5.5b wall-clock baseline Times the full inbound port-forward path: host TcpStream::connect → listener thread accept() → mpsc channel → process_pending_inbound_accepts → synthesize_inbound_syn → drain_to_guest output. Bounded above by PORT_FORWARD_POLL_INTERVAL (50ms). Regressions in the inbound state machine or listener poll loop now surface numerically against this baseline. * chore(bench): add scripts/bench-compare.sh — phase comparison report Compares HEAD against an arbitrary baseline ref using the two bench harnesses: divan microbenches (cargo bench --bench network) and the VM-backed wall-clock harness (voidbox-network-bench). Emits markdown with absolute numbers + percent deltas, suitable for PR descriptions. Replaces the scattered /tmp/baseline-network-phase*.json files with a reproducible single entry-point. * test(network): icmp_echo_returns_reply — probe + assert, no silent skip Reviewer (Copilot) flagged that the previous skip-on-no-reply path masked real ICMP regressions on hosts where unprivileged ICMP works. Probe socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) once: skip only on EPERM/EACCES (sysctl net.ipv4.ping_group_range forbids it), assert otherwise. * fix(network-bench): skip failed iterations + drop guest-ping ICMP path C1.3: measure_tcp_throughput_g2h and measure_bulk_throughput_g2h now `continue` on guest nc non-zero exit so failed iterations don't skew the reported mean. C2.2: measure_icmp_rr_latency dropped its guest-side ping path. The guest images intentionally omit /bin/ping (busybox-static lacks CONFIG_FEATURE_PING_TYPE_DGRAM and SOCK_RAW would need root); the function now returns None with a warn explaining the gap. Proper host-driven measurement is tracked as a follow-up. * bench(network): migrate from deprecated .poll() to drain_to_guest() Aligns benches with the production RX path. The deprecated poll() allocated a fresh Vec<Vec<u8>> per call; drain_to_guest appends to a caller-owned buffer that's reused across iterations. CI now gates on the same allocator pattern production code uses, removing avoidable allocation overhead from the measurements. Drops the file-level #![allow(deprecated)]. * chore(bench): bench-compare.sh — fall back without bench-helpers feature When the baseline ref pre-dates a464fc1 (Phase 5.5b.1) it doesn't have the `bench-helpers` cargo feature, and cargo errored out, dropping the entire baseline divan section to —. Detect the "does not have feature" / "unknown feature" stderr signal and retry without --features bench-helpers. Benches that exist at both refs get real Δ%; the bench-helpers-gated ones (synthesize_inbound_syn, tcp_inbound_syn_ack_transition) naturally remain — for baseline. Unlocks bench-compare.sh --baseline origin/main against the full branch history. * fix(network-bench): bound accept-thread lifetimes with deadlines drain_one_connection, rr_echo_server, crr_echo_server now accept with a deadline derived from LATENCY_RECV_TIMEOUT + slack. Previously each spawned thread blocked forever on listener.accept() if the guest nc never connected (exec error, network failure), holding the listener FD across all subsequent iterations and burning thread/FD slots. When the accept deadline lapses, the thread exits cleanly, the listener drops, and the next iteration starts with a clean slate. Addresses Copilot review C2.1, C2.5, C2.6. * docs: Phase 6 overview plan — TCP lifecycle + async connect + windows + epoll Scopes the four architectural follow-ups surfaced in the smoltcp-passt-port PR review: - 6.1 (high) : TCP half-close (FinWait*/CloseWait/LastAck) — silent data loss on shutdown(SHUT_WR) today. - 6.2 (med-h) : Async outbound connect — vCPU thread blocked up to 3 s on slow destinations. - 6.3 (med) : Window management + scaling — guest window ignored; advertised window hardcoded 65535. - 6.4 (med-l) : Event-driven RX polling — replace 5 ms timer with epoll_wait. Locks the observability + cross-platform + snapshot invariants from the top-level spec. Per-subsystem TDD task lists deferred to dedicated plans (-phase6.1.md..-phase6.4.md) written before each kicks off. * docs: Phase 6.4 detailed TDD plan — epoll-driven RX 14 bite-sized tasks covering the event-driven RX rewrite: - Task 1: drain_n migration + tcp_writes_more_than_256kb_succeed retransmit-test fix-up (Copilot C1.1; correctly classified as VALID after re-reading; the previous "stale" verdict missed that seq advances unconditionally on line 398). - Task 2: BROKEN_ON_PURPOSE pin tcp_rx_latency_sub_5ms. - Tasks 3-6: EpollDispatch module — epoll_create1, register/ unregister, wait_with_timeout, self-pipe Waker. - Task 7: SlirpBackend holds EpollDispatch + Waker. - Tasks 8-9: TCP/UDP/ICMP register-on-insert + unregister-on-remove. - Task 10: relay_*_data dispatch by readiness instead of full-table iteration. - Task 11: net_poll_thread rewritten to epoll_wait(50ms); BROKEN_ON_PURPOSE pin flips to passing. - Task 12: snapshot rebuild path (epoll_fd doesn't survive snapshot). - Task 13: tcp_rx_latency_one_packet bench + perf gate. - Task 14: full validation contract. Hard perf gate: scripts/bench-compare.sh --baseline origin/main must show every comparable bench at HEAD ≤ baseline + 5%, AND tcp_rx_latency_us_p50 < 5ms (was bounded below by 5ms timer cycle), AND port_forward_accept_latency improves by ≥30% (or documented exception). Lives on branch smoltcp-passt-port-phase6.4-epoll, cut from PR #68's tip (47868f0). Will become its own PR once the plan executes. * test(network): drain_n via drain_to_guest + real retransmit in 256kb test Two test-harness improvements landing together since both block the Phase 6.4 RX-latency work: - drain_n migrated from deprecated SlirpBackend::poll() to drain_to_guest. This was the last in-tree poll() caller. - tcp_writes_more_than_256kb_succeed now matches its 'we re-send those' comment: seq only advances when acked_seq catches up, giving real TCP-retransmit semantics in the synthetic guest rather than the previous 'lossy with 95% tolerance' shape. Phase 6.4 must not regress this contract; making the test faithful first means epoll regressions surface as failures instead of borderline 95% misses. * test(network): pin tcp_rx_latency_sub_5ms (BROKEN_ON_PURPOSE) Phase 6.4 contract: host→guest RX latency must be sub-5 ms when data is available. Pre-6.4 the floor is the 5 ms net_poll_thread sleep cycle; this assertion fails on master and on the current PR #68 tip. Phase 6.4's epoll dispatch will flip it to passing. Mark with #[ignore] is deliberately NOT used: this is a positive contract and CI must surface the failure on master so the gate is unmissable. * Revert "test(network): pin tcp_rx_latency_sub_5ms (BROKEN_ON_PURPOSE)" This reverts commit 3e47ffbd2ba9f7edee64047145bf0e6ff3a0a811. * docs(phase6.4): drop Task 2 unit-level pin — VMM-level contract instead Implementer caught a planning error: the 5 ms host→guest latency floor lives in src/vmm/mod.rs:1609 (net_poll_thread sleep), not in SlirpBackend. drain_to_guest runs synchronously from a unit test, so the floor never materializes there — the original Task 2 test underflowed on subtraction and would never have observed what we care about. The contract — host→guest must deliver in < 5 ms — is preserved as a VM-level requirement in Task 13's wall-clock tcp_rx_latency_us_p50 metric in voidbox-network-bench. The plan's hard-perf-gate already requires it, so no contract is lost. Reverts 3e47ffb (the broken unit pin) and amends the plan to mark Task 2 as DROPPED with rationale. * feat(network): EpollDispatch skeleton with epoll_create1 Phase 6.4 foundation. One epoll_fd owned via OwnedFd + EPOLL_CLOEXEC. No registration logic yet — Task 4 will add register/unregister and Task 6 will add the self-pipe + wait loop. * feat(network): EpollDispatch register/unregister * feat(network): EpollDispatch::wait_with_timeout * feat(network): EpollDispatch self-pipe wakeup Cloneable Waker writes one byte to a non-blocking pipe registered with EPOLLIN. wait_with_timeout filters self-pipe events out of the returned set and drains the pipe so subsequent waits don't spurious-wake. * refactor(slirp): SlirpBackend holds Arc<Mutex<EpollDispatch>> + Waker Start with Arc<Mutex<EpollDispatch>> directly instead of plain EpollDispatch so Task 11's poll-thread refactor (which needs the Arc to share the same instance across threads) lands without a second struct-field touch. SlirpBackend gains two new fields: - `epoll: Arc<Mutex<EpollDispatch>>` — the epoll instance - `epoll_waker: Waker` — cloneable write-end of the self-pipe Both fields are unused until Task 8 wires up the register/unregister calls; #[allow(dead_code)] suppresses clippy -D warnings in the interim. Targeted #[allow(dead_code)] attributes on unregister, wait_with_timeout, EpollEvent, and Waker::wake/write_end cover the same interim period. All 18 baseline pins pass; epoll_dispatch unit tests (5/5) pass. * feat(slirp): register TCP flows with EpollDispatch After each TCP flow-table insertion, register the host TcpStream FD with EpollDispatch (EPOLLIN, token = flow_token_for_tcp) and call epoll_waker.wake() to unblock any polling thread. Before each removal, unregister the FD to prevent dangling epoll registrations. Three insertion sites covered: - process_pending_inbound_accepts: port-forward SynSent entries - handle_tcp_frame: outbound SYN → SynReceived entries - insert_synthetic_synsent_entry (test helper): guarded with #[cfg(not(test))] so unit tests skip epoll side-effects Two removal sites covered: - relay_tcp_nat_data to_remove loop: Closed + timeout evictions - handle_tcp_frame stale-entry purge before re-SYN Token layout (flow_token_for_tcp): bits 63-56: 0x01 (PROTO_TAG_TCP) bits 47-32: guest_src_port bits 31-16: dst_port bits 15-0: low 16 bits of dst_ip PROTO_TAG_UDP, PROTO_TAG_ICMP, flow_token_for_udp, flow_token_for_icmp are defined here (at module scope, project convention) with #[allow(dead_code)] — Task 9 wires them immediately after. All 18 baseline pins pass; 5/5 epoll unit tests pass. * feat(slirp): register UDP + ICMP flows with EpollDispatch Mirror Task 8's TCP pattern for the two remaining protocol families. UDP (handle_udp_frame): - On first datagram for a new (guest_src_port, dst_ip, dst_port) 3-tuple, record new_host_fd before the Entry API moves the UdpSocket into the flow table, then register(fd, flow_token_for_udp, EPOLLIN) + wake(). - relay_udp_flows idle-reaper: unregister(sock.as_raw_fd()) before remove. ICMP (handle_icmp_frame): - On first echo request for a new (guest_id, dst_ip) pair, record new_icmp_fd before the Entry API moves the socket, then register + wake() after the borrow ends. - relay_icmp_echo idle-timeout eviction: look up sock fd and unregister before remove (borrow already dropped by then). Token layouts: flow_token_for_udp: 0x02 | guest_src_port<<32 | dst_port<<16 | dst_ip_low16 flow_token_for_icmp: 0x03 | guest_id<<32 All 18 baseline pins pass; 23/23 lib-network tests pass. Bench compilation still clean. * feat(slirp): relay loops dispatch by epoll readiness drain_to_guest polls the EpollDispatch with a zero-duration timeout and passes the resulting readiness set to the three relay methods (relay_tcp_nat_data, relay_icmp_echo, relay_udp_flows). Each relay now filters by protocol tag (PROTO_TAG_{TCP,UDP,ICMP}) and only visits flows whose socket appears as EPOLLIN-ready in the event set, avoiding O(flow_count) reads-on-every-tick. relay_tcp_nat_data uses a two-pass design: Pass 1 sweeps all TCP entries for Closed state and idle timeout unconditionally (so a guest FIN that marks an entry Closed in handle_tcp_frame causes the host TcpStream to drop promptly, giving the server-side reader an EOF); Pass 2 restricts the peek/relay I/O to ready entries only. epoll_arc() added to NetworkBackend trait (Linux cfg-gated, default None) and overridden on SlirpBackend. VirtioNetDevice.epoll_arc() delegates to the backend, enabling net_poll_thread (Task 11) to obtain the shared Arc without an additional lock or refactor. All 18 baseline pins pass. * feat(vmm): net_poll_thread driven by epoll_wait Replace the fixed 5 ms sleep with a blocking epoll_wait(50 ms) on the EpollDispatch instance obtained from the network backend. The thread wakes immediately when any registered host socket becomes readable (relay loop runs at event time, not after a fixed delay) and falls back to a 50 ms housekeeping tick when idle — preserving the UDP/ ICMP stale-flow reap path that was previously driven by the 5 ms sleep. If the backend does not expose an epoll instance (non-SlirpBackend, e.g. unit-test mocks), the thread keeps the original 5 ms sleep fallback. All 18 baseline pins pass. Release build clean. * feat(slirp): rebuild epoll set on snapshot restore `epoll_fd` is a Linux kernel handle that does not survive snapshot: after `MicroVm::from_snapshot` creates a fresh `SlirpBackend` via `SlirpBackend::new()`, the new `EpollDispatch` starts with zero registered FDs. The current snapshot path does not reconstruct `flow_table` — the backend always starts empty and new flows form naturally — so the rebuild is a no-op today. It is wired in advance so Phase 6.1's half-close work (which will persist restored flows across snapshot/restore) has a ready call site. Changes: - `EpollDispatch`: add `registered_count` field maintained by `register`/`unregister`; expose `registered_fd_count()` under `cfg(any(test, feature = "bench-helpers"))`. - `SlirpBackend::rebuild_epoll_from_flow_table()`: iterates `flow_table` and re-registers each live host FD (`host_stream`, `sock` for UDP/ICMP) with the current dispatcher. - `SlirpBackend::registered_fd_count()`: test/bench shim that delegates to `EpollDispatch::registered_fd_count()`. - `SlirpBackend::reset_epoll_for_snapshot_test()`: replaces the epoll dispatcher with a fresh empty one, simulating the post-snapshot state (kernel handle gone) for unit-level smoke tests. - `epoll_set_rebuilt_from_flow_table_smoke` in `network_baseline`: insert flow → reset epoll → assert count 0 → rebuild → assert count 1. * fix(test): gate epoll_set_rebuilt smoke test on bench-helpers feature The smoke test consumes #[cfg(any(test, feature = "bench-helpers"))]- gated helpers (insert_synthetic_synsent_entry, reset_epoll_for_snapshot_test, registered_fd_count). Integration tests in tests/ don't get cfg(test) on the void-box library crate — they only see #[cfg(feature = "bench-helpers")] items when the feature is enabled. Without this gate, default `cargo test --test network_baseline` fails to compile with E0599 on the four helper methods. Now: - Default cargo test → 18 pins pass, smoke test invisible. - cargo test --features bench-helpers -- --test-threads=1 → 19 pins pass, smoke test included. The serial-run requirement is to side-step a pre-existing parallel-run flake in tcp_port_forward_inbound_connect_succeeds (host port-bind contention; not a Phase 6.4 regression). * bench(network): tcp_rx_latency_one_packet — Phase 6.4 baseline Divan microbench (`tcp_rx_latency_one_packet`) measures the SLIRP-layer per-packet dispatch cost when one TCP flow is Established and the host kernel has data ready: one zero-timeout epoll_wait + readiness scan + peek + Ethernet frame construction. Measured median on this host: ~9.8 µs per drain_to_guest call. Pre-6.4 the relay iterated every flow in flow_table unconditionally regardless of readiness. Post-6.4 it dispatches only the flows with an epoll EPOLLIN event, reducing wasted work on idle flows to zero. This bench is the regression anchor for that change. The bench is gated on `--features bench-helpers` (like the existing `tcp_inbound_syn_ack_transition` and `synthesize_inbound_syn` benches). It performs a full 3-way handshake outside the timed loop so only the hot relay path is measured. Note: this bench cannot exercise the net_poll_thread 50 ms epoll cycle (that thread does not run inside divan). The wall-clock host→guest latency floor is the province of voidbox-network-bench's `tcp_rx_latency_us_p50` field. That field is added to the Report struct in this commit but returns None (deferred): wiring a guest-side listener requires either a guest daemon or an additional exec RPC — both out of scope for Phase 6.4. The divan microbench is the primary numerical deliverable for this phase. * perf(slirp): eliminate epoll mutex contention via event queue net_poll_thread holds the EpollDispatch mutex for the full 50 ms of its blocking wait. drain_to_guest's own non-blocking wait_with_timeout(ZERO) call contended on the same mutex, serializing the vCPU thread behind the net-poll thread. voidbox-network-bench saw TCP g2h throughput drop from ~1885 Mbps to ~44 Mbps (40× regression). Fix: SlirpBackend gets a small Mutex<Vec<EpollEvent>> queue. net_poll_thread pushes events into it after each successful wait_with_timeout. drain_to_guest drains the queue (brief uncontended lock) without touching EpollDispatch. A try_lock fallback path serves unit tests (no net_poll_thread) without blocking on the mutex. NetworkBackend trait gains a push_ready_events default-no-op so SlirpBackend can override it; VirtioNetDevice exposes push_events_to_backend as the trampoline called by net_poll_thread. Off-CPU profile evidence: drain_to_guest was 9% off-CPU (29.7s in a 60s window) waiting on the epoll mutex; should drop to near-zero post-fix. * perf(slirp): replace two-pass relay sweep with lazy close queue relay_tcp_nat_data's Pass 1 unconditionally copied every TCP FlowKey into a Vec to scan for Closed entries on every drain call. Cache misses 47/1K under load; poll_with_n_flows/100 regressed +246% (130ns → 450ns), /1000 regressed +220%. Fix: when handle_tcp_frame's FIN/RST handlers and mid-function error paths set state=Closed, push the key onto a pending_close Vec. relay_tcp_nat_data drains this Vec at the top of its single ready-events pass — no O(n) collect required. Idle-timeout detection retains a direct flow_table iteration but without allocating a separate key Vec. * perf(slirp): prefer try_lock on epoll over pending_events in drain_to_guest The initial Bug A fix used pending_events.lock() + try_lock(epoll) in drain_to_guest's fast path, adding ~150ns overhead per call vs Phase 6.4 (one extra Mutex acquire). This showed as +38% regression in poll_idle bench (441ns → 611ns). Revised approach: try_lock epoll first (zero cost when uncontended — tests, benches, idle net-poll thread). On Err (net_poll_thread holds the mutex for 50ms), drain pending_events instead. In production the try_lock fails ~once per 50ms window; in tests it always succeeds. Net result: drain_to_guest overhead matches Phase 6.4 when epoll is uncontended; contention eliminated when net_poll_thread is actively waiting. * perf(slirp): wake net-poll thread when process_guest_frame queues frames Pre-Phase-6.4, net_poll_thread woke unconditionally every 5 ms, so every ACK queued in inject_to_guest by handle_tcp_frame got flushed within 5 ms. Phase 6.4's epoll_wait(50 ms) waits for FD readiness events — but a guest writing data has no FD-side signal (the guest is the writer; the SLIRP-side socket only becomes readable when the host responds). So queued ACKs sat 50 ms before being flushed; TCP send window stalled; voidbox-network-bench TCP g2h dropped from ~1885 Mbps to ~225 Mbps even after the mutex-contention fix. Fix: track inject_to_guest length around process_guest_frame's ethertype dispatch. If the call queued any frames, call epoll_waker.wake() — one byte to the non-blocking self-pipe, which unblocks net_poll_thread's epoll_wait so the queued frames flush within microseconds. Also fixes the related drain_to_guest event-source ordering bug: pending_events (filled by net_poll_thread) is now ALWAYS drained first, with the non-blocking epoll poll only running as a fallback when the queue is empty (test/bench paths without net_poll_thread). The previous code took the try_lock branch when net-poll was between iterations and silently dropped events the net-poll thread had already pushed. voidbox-network-bench post-fix: g2h: ~6000 Mbps (vs master 1885; +3.2x) bulk-g2h: ~3900 Mbps (vs master 1565; +2.5x at SO_RCVBUF=4096) rr p50: 2 us (parity with master) crr p50: ~50 ms (5x regression vs master ~10 ms — separate bug, tracked in follow-up; the 50 ms is exactly one epoll_wait cycle and points to a connection-establishment latency issue independent of the throughput path) * perf(vmm): epoll_wait timeout 50ms → 5ms — restore CRR latency CRR p50 was regressing +40 ms (10 ms → 51 ms) post-Phase-6.4. The +40 ms exactly matches Linux's TCP delayed-ACK timer, and the cause is that Phase 6.4 widened the net-poll IRQ re-pulse cadence from 5 ms to 50 ms. The Linux guest spends most idle time in HLT and relies on regular vCPU scheduling slots — driven by our IRQ pulses — to advance its TCP delayed-ACK timer. At 50 ms cadence the guest's pure ACKs ride the next event-triggered IRQ, which can be 40+ ms away. At 5 ms the housekeeping cadence mirrors pre-6.4 and the timer fires on schedule. We lose Phase 6.4's headline "10x idle-wakeup reduction" goal but fast-path events still wake immediately via epoll readiness — so the net win vs master is unchanged: g2h throughput +250%, bulk throughput +250%, RR parity, CRR parity. voidbox-network-bench post-fix: g2h: ~6500 Mbps (vs master 1885; +247%) bulk-g2h: ~5400 Mbps (vs master 1565; +245%) rr p50: ~3 us (parity) crr p50: ~10100 us (parity — back to baseline 10 ms) * chore: remove Phase-N references from inline + doc comments Per AGENTS.md doc-comment style ("avoid ticket IDs and PR/commit references inside doc comments and inline comments — they belong in commit messages and PR descriptions where they're audit trail; in code they age into noise as the ticketing context evolves"). Phase references fall into the same category. Comments are rewritten in present tense to explain the structural reasoning without referencing when each piece landed. Identifiers like test names and BROKEN_ON_PURPOSE markers are unchanged. Plan/spec docs in docs/superpowers/plans/ are intentionally untouched — phase references there ARE the audit trail. * perf(vmm): adaptive epoll_wait timeout (5 ms active / 50 ms idle) Recovers Phase 6.4's headline 10x idle-wakeup reduction without re-introducing the +40 ms CRR regression that forced the cadence back to a fixed 5 ms. The adaptive policy: - last cycle had any kernel event → next timeout 5 ms (active) - last cycle timed out (no events) → next timeout 50 ms (idle) A single quiet cycle drops us to idle; a single event puts us back in active in the next cycle. The subtlety that motivated the additional EpollDispatch change: when the vCPU thread calls epoll_waker.wake() during a 50 ms idle wait, the kernel's epoll_wait returns with the self-pipe event. wait_with_timeout filters that event out and drains the pipe — so `epoll_events.is_empty()` would have remained true, and the naive "is_empty ⇒ idle" predicate kept us at 50 ms forever, regressing CRR p50 back to ~50 ms. wait_with_timeout now returns the *raw* kernel count (including self-pipe wakes) so the adaptive policy treats wakes as activity. Filtered events still arrive in the out parameter unchanged; only the return value's meaning shifted from "observable count" to "raw count," which all existing callers ignore. voidbox-network-bench post-fix: g2h: ~6680 Mbps (vs 5 ms fixed: 6500; vs master: +254%) bulk-g2h: ~5550 Mbps (vs 5 ms fixed: 5400; vs master: +254%) rr p50: 1 us (in 99-sample iteration; parity) crr p50: ~10100 us (parity preserved — adaptive correctly holds 5 ms cadence during connection bursts because each connection's wake() keeps raw_kernel_events > 0) Idle CPU dropped: profile-pre showed net_poll_thread on-CPU 4.93 % of total at fixed 5 ms cadence (200 wakes/sec); adaptive should drop to ~10x lower during idle stretches between iterations. * fix(slirp): collision-safe flow tokens via monotonic counter flow_token_for_tcp/udp truncated dst_ip to 16 bits; flow_token_for_icmp omitted dst_ip entirely. Multiple flows could collide on the same token, mis-routing readiness events to the wrong FlowKey. Replace the lossy encoding with a monotonic AtomicU64 counter per backend. Tokens are still tagged in the high byte for protocol demux (PROTO_TAG_TCP/UDP/ICMP); the lower 56 bits are unique. A new token_to_key HashMap makes readiness → FlowKey lookup O(1) instead of the previous linear flow_table scan. * fix(slirp): EpollDispatch lock-free — register/unregister never block net_poll_thread held the Mutex<EpollDispatch> across the blocking epoll_wait call (up to 50 ms in idle cadence). vCPU register/ unregister paths in handle_tcp_frame (and friends) had to acquire the same mutex and would block behind the wait, stalling guest TCP SYN handling for up to 50 ms during connection setup. epoll_ctl and epoll_wait are kernel-thread-safe on the same epoll fd; the only state requiring synchronization was the self-pipe (now eagerly initialized in EpollDispatch::new) and the registered fd count (now AtomicUsize). EpollDispatch becomes Sync without an external Mutex — the type changes from Arc<Mutex<EpollDispatch>> to Arc<EpollDispatch>. register/unregister run lock-free against the wait thread; only the kernel's per-epoll-fd internal lock serializes, and that's a fast path. * perf(slirp): O(1) dedup in idle-timeout sweep via HashSet to_remove.contains() inside the idle-timeout loop was O(n*k) under churn. Switch the membership check to a HashSet<FlowKey> and only materialize the Vec once at the end for the removal loop. * docs(vmm): net_poll_thread doc-comment matches adaptive timeout * style: rust-style sweep on Phase 6.4 / Copilot-fix code Apply project rust-style rules to the recently-landed Phase 6.4 code (token rewrite + lock-free EpollDispatch refactor): 1. RegisterMode enum replaces (readable: bool, writable: bool) on EpollDispatch::register. Closed-set policy at the call site (Read / Write / ReadWrite) over two opaque booleans. 2. matches!() removed at three sites — the project guide prefers full match (or boolean ==) for compiler diagnostics if the matched type changes. The unprivileged-ICMP errno check now uses == comparisons; FlowKey::Tcp counter uses a for loop. 3. Iterator chains in relay loops rewritten as for loops with mutable accumulators per the project rule. relay_tcp_nat_data, relay_udp_flows, relay_icmp_echo all touched. Logic unchanged; control flow now reads top-down without a chain of .filter().filter_map().collect(). 4. Local renamed `rc` → `epoll_ctl_result` at three EpollDispatch sites. Role-bearing names are required in non-tiny scopes. 5. Dropped redundant explanatory comments around the relay loops ("Data relay — only for flows with…", "Skip entries already queued for…", "Collect ready ICMP flow keys via…"). The code below them is self-describing. Kept structural "why" comments (the ICMP idle-sweep rationale, the per-flow socket Drop contract). No behavior change. cargo fmt, clippy -D warnings, network_baseline (18/18), lib network (23/23), and voidbox-network-bench wall-clock (g2h ~6580 Mbps, CRR ~32 µs) all green.
Status: Ready for review. Phases 0–5 + 5.5b of
docs/superpowers/plans/2026-04-27-smoltcp-passt-port.mdcomplete on this branch. Phase 6 (TCP half-close, async connect, window mgmt, event-driven polling) is scoped as an overview plan indocs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.mdand shipped as separate PRs.Supersedes the design in
docs/superpowers/plans/2026-04-12-network-backend-abstraction.md— instead of adoptingpasstas an external backend, this branch lifts its design patterns into our existing all-Rust smoltcp-based SLIRP stack. Keeps the all-Rust differentiator (full observability viatracing, debugger works, no opaque process boundary) while picking up passt's correctness wins (ICMP, full UDP, sequence-mirroring TCP, port-forwarding, stateless NAT).What's in this branch
Phase 0 — Baseline +
NetworkBackendtraittests/network_baseline.rs— unit-level pins drivingSlirpBackenddirectly (no VM), including BROKEN_ON_PURPOSE assertions that flip in later phases.benches/network.rs— divan microbenches for SLIRP hot paths.voidbox-network-bench— wall-clock harness with metric names matching passt's published table.NetworkBackendtrait extracted;SlirpStack→SlirpBackend(role-based name).Vec<Vec<u8>>allocation dropped (drain_to_guest(&mut Vec)out-param).Phase 1 — ICMP echo
SOCK_DGRAM IPPROTO_ICMPhost socket (unprivileged). Sysctl-restricted hosts get warn-once + drop fallback. BROKEN_ON_PURPOSE pin flipped:icmp_echo_returns_replypositively asserts the round-trip.Phase 2 — Generalized UDP
Phase 3 — TCP relay rewrite
recv(MSG_PEEK)for host→guest, no userspace buffer. ACK-driven kernel consume. guest→host now uses TCP backpressure (don't-ACK-on-EAGAIN) instead of a 256 KB userspace cap.to_guest/to_host/to_host_pending_ack/MAX_TO_HOST_BUFFERall deleted. BROKEN_ON_PURPOSE pin flipped:tcp_writes_more_than_256kb_succeedpushes 1 MB through the relay.Phase 4 — Unified flow table
FlowKey/FlowEntryenums collapse the paralleltcp_nat,udp_flows,icmp_echotables into a singleHashMap<FlowKey, FlowEntry>. Each protocol's relay loop iterates the unified table with amatches!filter. Establishes the substrate for Phase 5+ stateless NAT and Phase 6 epoll dispatch.Phase 5 — Stateless NAT translation refactor
src/network/nat.rswithtranslate_outbound(rules, dst, port, gateway)— pure function, no state, no allocation in the hot path.Rulescarriesgateway_loopback,deny_cidrs, andport_forwards. The TCP and UDP guest-frame handlers consume it through the same code path.deny_listfield dropped fromSlirpBackend(subsumed bynat::Rules.deny_cidrs).Phase 5.5b — Inbound port-forwarding
SlirpBackend::with_securityspawns one host listener thread per TCP port-forward rule. On accept, the listener pushes through an mpsc channel;process_pending_inbound_accepts(called at the top ofdrain_to_guest) seeds aSynSentflow-table entry and synthesizes the inbound SYN to the guest. NewTcpNatState::SynSenthandles the resulting SYN-ACK transition toEstablished.port_forward_accept_latency(~50 ms ceiling, bounded byPORT_FORWARD_POLL_INTERVAL).Bench harness + comparison
cargo bench --bench network --features bench-helpers— divan microbenches.voidbox-network-bench --output report.json [--bulk-mb N]— wall-clock harness, JSON report.scripts/bench-compare.sh --baseline <ref> [--skip-vm] [--skip-divan]— markdown comparison of HEAD vs any baseline ref. Falls back gracefully when the baseline ref pre-datesbench-helpers(Phase 5.5b).Reviewer-driven fixes (this PR)
Bundled correctness improvements from the Copilot review on this PR:
icmp_echo_returns_replynow probessocket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP)once and asserts on the round-trip; only skips on EPERM/EACCES (sysctl restriction).measure_tcp_throughput_g2handmeasure_bulk_throughput_g2hnowcontinueon guest non-zero exit so failed iterations don't skew the mean.voidbox-network-benchbound withaccept_with_deadlineso spawned echo/drain threads exit cleanly when the guest never connects.measure_icmp_rr_latencydropped its guest-sidepinginvocation (the test/production images intentionally omit/bin/ping); the metric is now an explicitNonewith a tracing warning until a host-driven measurement path lands in Phase 6.benches/network.rsmigrated from the deprecatedSlirpBackend::poll()(per-callVec<Vec<u8>>alloc) todrain_to_guest(&mut Vec<Vec<u8>>)with a reused buffer;#![allow(deprecated)]removed.Bonus fixes (caught while debugging)
77dfc67):chmod u+son user-packed cpio dropped PID 1 euid to uid 1000, breakingCAP_NET_ADMIN-needingsetup_network. Reverted.0d0ab20):voidbox-startup-bench'scapture_snapshotwas missing.enable_snapshots(true), so cold-boot usedVhostand warm-restore (alwaysUserspace) failed. Fixed;continue-on-error: trueremoved from the bench step (was masking this onmain).Non-negotiable invariants (preserved through every phase)
tracing— every state transition emits a structured event.cargo test-driveable — every behavior change exercised bytests/network_baseline.rswithout a VM.CI changes
voidbox-network-bench --iterations 3step instartup-bench.yml, nocontinue-on-error.continue-on-error: trueremoved from thevoidbox-startup-benchwall-clock step.Validation
cargo fmt --all -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo test --workspace --all-featurescargo test --test network_baselinecargo bench --bench network --features bench-helperscargo test --test snapshot_integration -- --ignoredcargo test --test e2e_mount -- --ignoredvoidbox-network-bench --iterations 3voidbox-startup-bench --iters 3 --breakdownBench comparison
scripts/bench-compare.sh --baseline <Phase-0B.1-tip> --skip-vm(the earliest commit that has the bench file):process_synprocess_arp_requestpoll_idledrain_to_guest)tcp_bulk_throughput_1mb,port_forward_accept_latency,nat_translate_outbound_hot_path,process_udp_frame,process_icmp_echo_request, …)The work's wins are primarily capability (ICMP/UDP/inbound port-forwarding work where they didn't before; the 256 KB TCP cliff is gone) and architecture (unified flow table, stateless NAT). Raw guest→host throughput is roughly flat (~1885 Mbps), which is expected — passt-style refactoring buys correctness and observability, not headline bytes/sec.
What's NOT in this branch (deferred)
docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.md— overview of TCP half-close, async connect, window mgmt, event-driven polling. Per-subsystem TDD plans deferred to when each kicks off.conformancefailures — 3 tests fail (use/tmppaths rejected by PR hardening: openat2-based symlink-escape guard for privileged FS RPCs (R-B2.1 + R-B2.2) #56'sfs_guard); confirmed pre-existing onmain, not Phase X regressions.Test plan for review
cargo test --test network_baseline— pins pass, the three flipped BROKEN_ON_PURPOSE pins are now positive assertions.cargo bench --bench network --features bench-helpers tcp_bulk_throughput_1mb— should produce a measurement.voidbox-network-bench --iterations 3 --bulk-mb 10— both throughput numbers (constrained vs unconstrained) without errors.voidbox-startup-bench --iters 3 --breakdown— warm phase exits 0.scripts/bench-compare.sh --baseline 41c8382 --skip-vm— validates the comparison harness.docs/superpowers/plans/2026-04-27-smoltcp-passt-port.mdand2026-04-30-smoltcp-passt-port-phase6.mdfor the design rationale and follow-up scope.